[Previous] [Next] [Index] [Thread]

Site Scaning & IP graps




[You ("Brian W. Spolarich")]
>  Good spiders will ask for /robots.txt and find out what to do with 
>themselves if they find it.
>
>  Generally grepping for /robots.txt will give you a list of spiders that 
>have found you.

Very true.  In fact, on my server I've ScriptAliased /robots.txt to
the following little perl script.  This lets me grab a little more information
from the robot which the server by default doesn't get, namely, the
HTTP_FROM address advertised.

--------------------------code snippet
#!/usr/bin/perl

$Log = '/var/adm/httpd_robots';
@Interesting = ('HTTP_USER_AGENT', 'REMOTE_ADDR', 'REMOTE_HOST', 'HTTP_FROM');

print "Content-type: text/plain\n\n";
print "User-agent: *\nDisallow:\n\n";

open(LOG, ">>$Log") || die("Can't open $Log: $!\n");

print LOG '[' . localtime() . ']';

foreach $env (@Interesting) {
   print LOG "\t$env=$ENV{$env}";
}
print LOG "\n";
close LOG;
--------------------------end code snippet

Some of the lines produced by this (I've wrapped returns with '\'):

[Thu Feb  8 00:44:51 1996]      HTTP_USER_AGENT=Scoutget 1.0    REMOTE_ADDR=206.\
101.96.35       REMOTE_HOST=seventeen.srv.lycos.com     HTTP_FROM=
[Thu Feb  8 01:48:31 1996]      HTTP_USER_AGENT=OTI_Spider/OTWR:002p116  libwww/\
2.17    REMOTE_ADDR=205.216.146.179     REMOTE_HOST=205.216.146.179     HTTP_FRO\
M=gregf@opentext.com
[Thu Feb  8 15:29:17 1996]      HTTP_USER_AGENT=OTI_Spider/OTWR:002p116  libwww/\
2.17    REMOTE_ADDR=205.216.146.179     REMOTE_HOST=dialup-a.mv.opentext.com\
HTTP_FROM=gregf@opentext.com
[Sun Feb 11 03:00:29 1996]      HTTP_USER_AGENT=CERN-LineMode/2.15  libwww/2.17\
REMOTE_ADDR=199.107.235.42      REMOTE_HOST=199.107.235.42      HTTP_FROM=vic@ap\
ollo.alphaspace.com  

Interestingly, it seems that Lycos doesn't populate the HTTP_FROM environment.
Odd.

.....A. P. Harris...apharris@onShore.com...<URL:http://www.onShore.com/>


References: